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We introduce an information theoretic measure of statistical structure, called 'binding informa- 
tion', for sets of random variables, and compare it with several previously proposed measures in- 
cluding excess entropy, Bialek et al.'s predictive information, and the multi-information. We derive 
some of the properties of the binding information, particularly in relation to the multi-information, 
and show that, for finite sets of binary random variables, the processes which maximises binding 
information are the 'parity' processes. Finally we discuss some of the implications this has for the 
use of the binding information as a measure of complexity. 
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I. INTRODUCTION 

The concepts of 'structure', 'pattern' and 'complexity' 
are relevant in many fields of inquiry: physics, biology, 
cognitive sciences, machine learning, the arts and so on; 
but are vague enough to resist being quantified in a single 
definitive manner. One approach, which we adopt here, is 
to attempt to characterise them in statistical terms, for 
distributions over configurations of some system, using 
the tools of information theory [l[ . 

In this letter, we propose a measure of statistical struc- 
ture based on the concept of predictive information rate 
(PIR) Q, which measures an aspect of temporal depen- 
dency not captured by previously proposed measures. 
We review a number of these earlier proposals and the 
PIR, and then define the binding information as the ex- 
tensive counterpart of the PIR applicable to arbitrary 
countable sets of random variables. After describing 
some of its properties, we identify some finite discrete 
processes that maximise the binding information. 

In the following, if X is a random process indexed by 
a set A, and B C A, then Ag denotes the compound 
random variable (random 'vector') formed by taking X a 
for each a £ B. The set of integers from M to N inclusive 
will be written M..N, and \ will denote the set difference 
operator, so, for example, Ax..3\{2} = (Xi, A3). 



II. BACKGROUND 

Suppose that (. . . , X-±, Xq, X\, . . .) is a bi-infinite sta- 
tionary sequence of random variables, and that Vi G Z, 
the random variable X t takes values in a discrete set X. 
Let [i be the associated shift-invariant probability mea- 
sure. Stationarity implies that the probability distribu- 
tion associated with any contiguous block of N variables 
(X t +i, . . . , A t+ jv) is independent of t, and therefore we 
can define a shift-invariant block entropy function: 

H(N) 4 H(X U ...,X N ) = Y,-P* to logP? to, (1) 

where p^ : X N — > [0, 1] is the unique probability mass 
function for any N consecutive variables in the sequence, 
p» (x) 4 Pr(X 1 = x i A ... A Ajy = x N ). 



The entropy rate has two equivalent definitions in 
terms of the block entropy function [l|, Ch. 4]: 
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The block entropy function can also be used to express 
the mutual information between two contiguous segments 
of the sequence of length N and M respectively: 

^-ff..-i;4.M-i) = H(N)+H(M)-H(N+M). (3) 

If we let both block lengths N and M tend to infinity, we 
obtain what has been called the excess entropy [1| or the 
effective measure complexity [J. It is the amount of in- 
formation about the infinite future that can be obtained, 
on average, by observing the infinite past: 



E = lim 2H(N) — H(2N). 



N 



(4) 



Bialek et al. Q defined the predictive information 
2pred(A) as the mutual information between a block of 
length N and the infinite future following it: 



Iprcd(A) 



lim 



H(N) + H(M)-H(N + M). (5) 



They showed that even if X pre< i(A) diverges as N tends to 
infinity, the manner of its divergence reveals something 
about the learnability of the underlying random process. 
Bialek et al. also emphasised that I pre d(A) is the sub- 
extensive component of the entropy: if Nh^ is the purely 
extensive (i.e., linear in N) component of the entropy, 
then 2p re d(A) is the difference between the block entropy 
H(N) and its extensive component: 



H(N) = Nh^+l pied (N). 



(6) 



The multi-information Q is defined for any collection 
of N random variables (X%, . . . , Ajv) as 



I(X 



1..N) 
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(7) 
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For N = 2, the multi- information reduces to the mutual 
information I(X±; X%), while for N > 2, I(Xi : n) contin- 
ues to be a measure of dependence, being zero if and only 
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(c) predictive information rate b M 

FfG. 1. Venn diagram representation [J Ch. 2] of sev- 
eral information measures for stationary random processes. 
Each circle or oval represents a random variable or sequence 
of random variables relative to time t = 0. Overlapped areas 
correspond to various mutual informations. In (c), the circle 
represents the 'present'. Its total area is H(Xo) = H(l) = 
Pm + t"m + b,j,, where pp is the multi- information rate, is the 
residual entropy rate, and fo M is the predictive information 
rate. The entropy rate is hp. — r M + 6 M . 



if the variables are statistically independent. In the ther- 
modynamic limit, the intensive multi-information rate 
(cf. Dubnov's information rate Q) can be defined as 



N 



lim /(Xljv) - /(Xljv-i). 



(8) 



It can easily be shown that p^ = Z pre d(l) = H(l) — h^. 
Erb and Ay § studied this quantity (they call it /) and 
showed that, in the present terminology, 



I(X 1 .. N )+X pred (N)=Np ft . 



(9) 



Comparing this with (|6|), we see that 2 pre( j(A r ) is also 
the sub-extensive component of the multi-information. 
Thus, all of the measures considered so far, being linearly 
dependent in various ways, are closely related. 

Another class of measures, including Grassberger's 
true measure complexity \4\ and Crutchfield et aZ.'s sta- 
tistical complexity [9|,|lQ|, is based on the properties of 
stochastic automata that model the process under con- 
sideration. These have some interesting properties but 
are beyond the scope of this letter. 

In [2|, we introduced the predictive information rate 
(PIR), which is the average information in one observa- 
tion about the infinite future given the infinite past. If 
Xt = (• • • , X t -2, X t -i) denotes the variables before time 

t, and X t = (X t +i, X t +2, ■ ■ ■) denotes those after i, the 
PIR is defined as a conditional mutual information: 

Z t = I{X t ;X t \X t ) = H(X t \X t ) - H(X t \X t ,X t ). (10) 



Equation (fT0|) can be read as the average reduction in 
uncertainty about the future on learning X t , given the 
past. Due to the symmetry of the mutual information, it 

can also be written as X_ t — H(Xt\Xt) — H(Xt\Xt, Xt). 

H(X t \X t ) is the entropy rate h^, but H(X t \X t , X t ) is a 
quantity that does not appear to be have been considered 
by other authors yet. It is the conditional entropy of one 
variable given all the others in the sequence, future as 
well as past. We call this the residual entropy rate r^, 
and define it limit: 
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lim H(X_ N .. N ) - H(X^ 
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(11) 



The second term, H(X_n .._i,Ai. jv), is the joint entropy 
of two non-adjacent blocks with a gap between them, 
and cannot be expressed as a function of block entropies 
alone. If we let 6 M denote the shift-invariant PIR, then 
bfj, = - r M (sec Fig. 

Many of the measures reviewed above were intended 
as measures of 'complexity', a quality that is somewhat 
open to interpretation [IJ, [Tj]. It is generally agreed, 
however, that complexity should be low for systems 
that are deterministic or easy to compute or predict — 
'ordered' — and low for systems that a completely random 
and unpredictable — 'disordered'. The PIR satisfies these 
conditions without being 'over-universal' in the sense of 
Crutchfield et al. [l2l [l3| : it is not simply a function of 
entropy or entropy rate that fails to distinguish between 
the different strengths of temporal dependency that can 
be exhibited by systems at a given level of entropy. In our 
analysis of Markov chains Q, we found that processes 
which maximise the PIR do not maximise the multi- 
information rate p^ (or the excess entropy, which is the 
same in this case), but do have a certain kind of partial 
predictability that requires the observer continually to 
pay attention to the most recent observations in order to 
make optimal predictions. And so, while Crutchfield et 
al. make a compelling case for the excess entropy E and 
their statistical complexity C M as measures of complex- 
ity, there is still room to suggest that the PIR captures a 
different and non trivial aspect of temporal dependency 
structure not previously examined. 



III. BINDING INFORMATION 

If the PIR rate is accumulated over successive time 
steps, a quantity which we call the binding information 
is obtained. To proceed, we first reformulate the infinite 
sequence PIR fj 10[) so that it becomes applicable to a 
finite sequence of random variables {X\, . . . , X^): 

Zt(Xi..N) = I(X t ; X( t+1 ym\X 1 ^ t _ 1 j), (12) 

Note that this is no longer shift-invariant and may de- 
pend on t. The binding information, then, is the sum 



B(X 1 .. N ) = £ UXl.n). 



(13) 



*ei..Af 
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(a) H(Xi.. A ) (b) /(X1..4) (c) B(Xi..i) 

FIG. 2. Illustration of binding information as compared with 
multi-information for a set of four random variables. In each 
case, the quantity is represented by the total amount of black 
ink, as it were, in the shaded parts of the diagram. Whereas 
the multi-information counts the multiply-overlapped areas 
multiple times, the binding information counts each over- 
lapped areas just once. 

Expanding this in terms of entropies and conditional en- 
tropics and cancelling terms yields 

B(X 1 .. N )=H(X 1 .. N )- ]T H(X t \Xt.. N \ {t} ). (14) 
tei.-N 

Like the multi-information, it measures dependencies 
between random variables, but in a different way (see 
fig. IIIip . Though the binding information was derived by 
accumulating the PIR sequentially, the result is permu- 
tation invariant, suggesting that the concept might be 
applicable to arbitrary sets of random variables regard- 
less of their topology. Accordingly, we define the binding 
information as follows: 

Definition 1. If {X a \a E A} is set of random variables 
indexed by a countable set A, the binding information is 

B(X A ) 4 H(X A ) - £ H(X a \X A \ {a} ). (15) 

aeA 

Since the binding information can be expressed as a 
sum of (conditional) mutual informations between sets 
of random variables (fT3|) . it is (a) non- negative and (b) 
invariant to invertible pointwise transformations of the 
variables; that is, if Y A is a set of random variables such 
that, Va E A, Y a = f a {X a ) for some invertible functions 
/„, then B{Y A ) = B{X A ). _ 

The binding information is zero for sets of independent 
random variables — the case of complete 'disorder' — and 
zero when all variables have zero entropy, taking known 
values and representing a certain kind of 'order'. How- 
ever, it is also possible to obtain low binding information 
for random systems which are nonetheless very ordered in 
a certain way. If each variable X a is some function of X a > 
for all a' ^ a, then the state of the entire system can be 
read off from any one of its component variables. In this 
case, it is easy to show that B(X A ) = H(X A ) = H(X a ) 
for any a € A, which, as we will see, is relatively low 
compared with what is possible as soon as N becomes 
appreciably large. Thus, binding information is low for 
both highly 'ordered' and highly 'disordered' systems, 
but in this case, 'highly ordered' does not simply mean 
deterministic or known a priori: it means the whole is 
predictable from the smallest of its parts. 




FIG. 3. Constraints on multi-information I(Xi..n) and bind- 
ing information B(Xi..n) for a system of N — 6 binary ran- 
dom variables. The labelled points represent identifiable dis- 
tributions over the 2 N states that this system can occupy: (a) 
known state, the system is deterministically in one configura- 
tion; (b) giant bit, one of the P% processes; (c) parity, the 
parity processes P|,o or (d) independent, the system of 

independent unbiased random bits. 



IV. BOUNDS ON BINDING AND 
MULTI-INFORMATION 

In this section we confine our attention to sets of dis- 
crete random variables taking values in a common al- 
phabet containing K symbols. In this case, it is quite 
straightforward to derive upper bounds, as functions of 
the joint entropy, on both the multi-information and the 
binding information, and also upper bounds on multi- 
information and binding information as functions of each 
other. In [lij . we prove the following results: 

Theorem 1. If {X a \a G A} is a set of N — \A\ random 
variables all taking values in a discrete set of cardinality 
K , then the following constraints all hold: 

I(X A ) <NlogK -H(X A ) (16) 

I(X A ) <{N- l)H(X A ) (17) 

B(X A ) < H(X A ) (18) 

B(X A ) <(N -l)(N\ogK -H(X A )). (19) 

Also, B{X A ) and I{X A ) are mutually constrained: 

I(X A ) + B(X A ) <NlogK. (20) 

These bounds restrict I(X A ) and B{X A ) to two trian- 
gular regions of the plane when plotted against the joint 
entropy H{X A ) and are illustrated for TV = 6, K = 2 in 
fig. HVl Two more linear bounds were suggested by em- 
pirical computations of binding information and multi- 
information: 

I(X A ) <(N- l)B(X A ) (21) 
and B(X A ) <(N- 1)I(X A ). (22) 

We have not found a general proof of these inequalities for 
all N, but we have constructed a numerical algorithm [l4| 
that is able to find proofs for given values of N up to 37, 
at which point insufficient numerical precision becomes 
the limiting factor. 
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V. MAXIMISING BINDING INFORMATION 



VI. DISCUSSION AND CONCLUSIONS 



Is the absolute maximum of B(Xijq) = (N — 1) log K 
implied by Theorem [T] is attainable, and by what kinds 
of processes? In [1J| we prove the following: 

Theorem 2. If{X\, . . . ,Xn} is a set of discrete random 
variables each taking values in 0..(K — 1), then B(X\..n) 
is maximised at (N — 1) log 2 K bits by the K 'modulo- 
K processes' P£l for m G 0..(K — 1), under which the 
probability of a configuration x G (0..K — 1) N is 



-Pff,m( x ) 



K 




l - N if 



mod K = m. 



otherwise. 



(23) 



When K = 2 (binary random variables) the maximal 
binding information of N — 1 bits is reached by the two 
'parity' processes: P£ is the 'even' process, which dis- 
tributes uniform probability over all configurations with 
even parity; P 2 A o ^ s the 'odd' process, which distributes 
uniform probabilities over the complementary set. The 
multi- information of the parity processes is 1 bit. By 
contrast, the binary processes which maximise the multi- 
information at N— 1 bits are the 'giant bit' processes: 
the indices 1..N are partitioned into two sets B and its 
complement B = 1..N \ B, and probabilities assigned to 
configurations x e {0, 1} N as follows: 



if Vi G L.N.Xi = I(i G B), 
if Vi G l.JV.Xj = l(ieB), 
otherwise, 



(24) 



where I(-) is 1 if its argument is true and otherwise. 
The binding information of these processes is 1 bit. Thus 
we see that the processes which maximise the binding 
information and the multi-information are quite different 
in character. 



As noted in §[]] Bialek et al. argue that the predic- 
tive information 2 pre( j(A r ), being the sub-extensive com- 
ponent of the entropy, is the unique measure of complex- 
ity that satisfies certain reasonable desiderata, including 
transformation invariance for continuous- valued variables 
[1 §5.3]. While lack of space precludes a full discus- 
sion, we note that transformation invariance does not, as 
Bialek et al. state [H, p. 2450], demand sub-extensivity: 
binding information is transformation invariant, since it 
is a sum of conditional mutual informations, and yet 
it can have an extensive component, since its intensive 
counterpart, the PIR, can have a well-defined value, e.g., 
in stationary Markov chains @ ■ 

Measures of statistical dependency are discussed by 
Studeny and Vejnarova, d, §4], who formulate a 'level- 
specific' measure that captures the dependency visible 
when fixed size subsets of variables are examined in iso- 
lation. Studeny and Vejnarova [f| p. 277] use the parity 
process as an example of a random process in which the 
dependence is only visible at the highest level, that is, 
amongst all N variables; if fewer than N variables are ex- 
amined, they appear to be independent. They note that 
such processes were called 'pseudo-independent' by Xi- 
ang et al. who concluded that standard algorithms 
for Bayesian network construction fail when applied to 
them. It is intriguing, then, that these are singled out 
as 'most complex' according to the binding information 
criterion. 

To summarise, we have introduced binding information 
as a measure of statistical structure that can be applied 
to any countable set of random variables regardless of 
any topological organisation of the variables. Binding 
information is maximised in finite discrete valued sys- 
tems by the 'modulo process'. Further results on binding 
information, and investigations of binding information in 
some specific random processes are presented in [l4[ ■ 
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